Every moment counts in action recognition. A comprehensive understanding ofhuman activity in video requires labeling every frame according to the actionsoccurring, placing multiple labels densely over a video sequence. To study thisproblem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a newdataset of dense labels over unconstrained internet videos. Modeling multiple,dense labels benefits from temporal relations within and across classes. Wedefine a novel variant of long short-term memory (LSTM) deep networks formodeling these temporal relations via multiple input and output connections. Weshow that this model improves action labeling accuracy and further enablesdeeper understanding tasks ranging from structured retrieval to actionprediction.
展开▼